Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 23, 2025

ref #4130 (reply in thread)

Current logic in this PR (subject to change):

  • When using unified KV cache with -kvu, share the entire context -c N among all parallel slots of the server -np N
  • When we run out of space, try to free some by purging old sequences from idle slots, one by one, in no particular order
  • If we still run out of space, terminate all active slots at once
  • The -np N argument is still utilized to control the max number of parallel jobs, but it is no longer used to change the per-slot context
  • By default, start the server using 4 slots and unified KV cache

Example:

llama-server -m model.gguf -c 8192 --jinja

TODO:

  • When we run out of space, terminate the active slots one-by-one and keep trying
  • Think about instead of purging, to move the slot into host-memory cache. Not sure that this is really needed thanks to the existing logic from server : host-memory prompt caching #16391
  • Add tests

Future improvements:

  • When run out of space, terminate slots one by one instead of all together
  • Update logic for starting a new task to check that it has some extra room for generation (not very sure if needed, current logic will simply purge one of the other slots, so it should be good as it is)


uint32_t llama_context::n_ctx_per_seq() const {
return cparams.n_ctx / cparams.n_seq_max;
return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this value be capped when using unified cache to avoid exceeding the model context length? I think it could be set to min(n_ctx_train, n_ctx), or add a parameter to allow the user to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can cap it to n_ctx_train. The only use case for n_ctx > n_ctx_train that comes to mind is self-extend, but lately this technique seems less relevant.

We can also cap it for the non-unified case?

Suggested change
return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also cap it for the non-unified case?

What would happen to the leftover slots? I may be misunderstanding the way split cache works, but my assumption would be that these slots would never be used, and it would be wasted memory. So if that's capped, it should be done at context creation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should do the capping at context creation in the llama_context constructor. Currently we have some additional logic for this in llama-model:

llama.cpp/src/llama-model.cpp

Lines 19708 to 19724 in 7863fcc

const auto padding = llama_kv_cache::get_padding(cparams);
uint32_t n_ctx_per_stream = cparams.n_ctx;
if (!cparams.kv_unified) {
n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;
n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);
cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;
} else {
n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);
cparams.n_ctx = n_ctx_per_stream;
}
LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

Since we no longer need the padding logic (as of #16148 and related) we should simplify this.

I'll push a separate PR for this and then will come back to polishing this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now rebased on top of the changes in #16812. The result is that we determine the KV cache size during context creation and there should be no leftover KV cells.

Note that since we now cap the context size to the training context size, the user code is recommended to query llama_n_ctx and llama_n_ctx_seq after creating the llama_context in order to obtain the actual context size. I'll add comments in llama.h to reflect this.

Will try to clean-up this PR next and will open it for review when ready.

@github-actions github-actions bot added the python python script changes label Oct 23, 2025
@ggerganov ggerganov force-pushed the gg/server-unified-slots branch 4 times, most recently from 55bb9db to 6369fe0 Compare October 28, 2025 10:50
@github-actions github-actions bot added the testing Everything test related label Oct 28, 2025
@ggerganov ggerganov force-pushed the gg/server-unified-slots branch from 6369fe0 to ac261be Compare October 29, 2025 14:13
Comment on lines +139 to +140
if (cparams.n_ctx_seq > hparams.n_ctx_train) {
LLAMA_LOG_WARN("%s: n_ctx_seq (%u) > n_ctx_train (%u) -- possible training context overflow\n",
__func__, cparams.n_ctx_seq, hparams.n_ctx_train);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should not be reached due to the capping above on line 117. But keeping it in case the capping logic gets changed in the future.

@ggerganov ggerganov force-pushed the gg/server-unified-slots branch 2 times, most recently from 0ba88d3 to 4e9e319 Compare October 30, 2025 17:01
@ggerganov ggerganov marked this pull request as ready for review October 30, 2025 18:39
@ggerganov ggerganov requested review from CISC and ngxson as code owners October 30, 2025 18:39
@ggerganov
Copy link
Member Author

Ready for review. I've marked some TODOs for follow-up PRs since I think the current implementation is quite basic and at the same time gets us 90% on the way to the ideal logic. Will improve the rest of the cases from master.

@ggerganov ggerganov requested a review from slaren October 30, 2025 18:41
Comment on lines 115 to 120
if (cparams.kv_unified) {
cparams.n_ctx_seq = cparams.n_ctx;
} else {
cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max;
}

if (cparams.n_ctx_seq > hparams.n_ctx_train) {
LLAMA_LOG_WARN("%s: capping n_ctx_seq (%u) to n_ctx_train (%u)\n", __func__, cparams.n_ctx_seq, hparams.n_ctx_train);

cparams.n_ctx_seq = hparams.n_ctx_train;
}

if (cparams.kv_unified) {
cparams.n_ctx = cparams.n_ctx_seq;
} else {
cparams.n_ctx = cparams.n_ctx_seq * cparams.n_seq_max;
}
Copy link
Member

@slaren slaren Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not completely convinced about this, I think it may create confusion, and add complexity to applications. The server and other applications using the unified cache need a sequence length limit independent of n_ctx, but that should probably be a different parameter that defaults to min(n_ctx, n_ctx_train). This would be an application parameter, not part of the llama.cpp API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Just remove the capping here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be preferable to not have a limit here. The user should be able to override the model n_ctx_train, and it is easier to do it this way than with a KV override.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the capping logic to the llama-server.

Comment on lines +4435 to +4448
if (params.n_parallel == 1 && params.kv_unified == false) {
LOG_WRN("%s: setting n_parallel = 4 and kv_unified = true\n", __func__);

params.n_parallel = 4;
params.kv_unified = true;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this can't be default params in arg.h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can make it the default - I thought that some of the examples might not like it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I didn't notice that there are multiple example all using n_parallel

In this case, maybe we can use a dedicated variable for server, like params.n_parallel_server ?

This can be useful when auto-generating the documentation for server args

n_batch /= 2;
}

SRV_WRN("failed to find free space in the KV cache, retrying with smaller batch size, i = %d, n_batch = %d, ret = %d\n", i, n_batch, ret);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this warning should be moved inside the if condition above, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe I forgot this from a discussion before, but currently in which case we need to retry with a small batch size?

Copy link
Member Author

@ggerganov ggerganov Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main case for retrying with smaller batches was back when we didn't have ggml_set_rows and we always had to search for contiguous set of cells (KV slots) inside the cache buffer to place the input batch. Now with ggml_set_rows this is no longer needed and technically, retrying with a smaller batch size almost has almost no purpose except in some rare cases.

But generally, when llama_decode returns 1, you should retry with a smaller batch.

@ngxson
Copy link
Collaborator

ngxson commented Nov 1, 2025

If we still run out of space, terminate all active slots at once

Hmm this could be a bit of a bummer in term of UX. For example this case:

  • User starts a slot and generate text, the slot using almost all of the context size
  • At the same time (when the first slot is still generating text), the user submit a new request which starts a new slot
  • Now both slot competes with each other, which eventually cause the second slot to be terminated too early

An idea for improvement could be to only allow starting a task when the remaining context passes a threshold (maybe free more than a half?), otherwise defer the task. (Ofc we can implement this in a follow-up PR)

@ggerganov
Copy link
Member Author

Yes, there are several edge cases that can be handled better. This case specifically probably can be handled even better - when the total context gets filled up, move one active sequence to host-memory cache and resume it later when another sequence finishes. This way we don't need special thresholds and both sequences would eventually finish (after a short pause for one of them of course).

@ggerganov ggerganov force-pushed the gg/server-unified-slots branch from 93373cc to c08d0d1 Compare November 1, 2025 15:45
if (cparams.kv_unified) {
cparams.n_ctx_seq = cparams.n_ctx;
} else {
cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an error could be returned here if n_ctx is not a multiple of n_seq_max, since that's likely to be a mistake.

Copy link
Member Author

@ggerganov ggerganov Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning. The problem that I see with throwing an error is that the user might often want to use the default training context for example split among 3 sequences. And in the majority of cases the training context typically a power of 2 would not be divisible by 3, resulting in an error.

@ggerganov ggerganov merged commit cd5e3b5 into master Nov 2, 2025
68 of 74 checks passed
@ggerganov ggerganov deleted the gg/server-unified-slots branch November 2, 2025 16:14
@EverchangerL
Copy link

EverchangerL commented Nov 2, 2025

Hi, by default there is unified KV cache with 4 slots, but setting only "--parallel 1" still uses 4 slots with unified KV cache
however, it uses 1 slot as "--kv-unified --parallel 1", or simply "--kv-unified"

is this correct that "--parallel 1" doesn't work as expected (should use 1 slot) without "--kv-unified"?

full command: llama-server.exe --model Prototype-X-12B-Q4_K_S.gguf --host 127.0.0.1 --port 5001 --ctx-size 16384 --gpu-layers 37 --threads 4 --threads-batch 6 --batch-size 256 --cache-type-k q4_0 --cache-type-v q4_0 --no-mmap --mlock --no-webui --cache-ram -1 --parallel 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants